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Abstract 

Background: At the moment, there are a number of publications describing gene expression profiling in virus- 
infected plants. Most of the data are limited to specific host-pathogen interactions involving a given virus and a 
model host plant - usually Arabidopsis tholiana. Even though several summarizing attempts have been made, a 
general picture of gene expression changes in susceptible virus-host interactions is lacking. 

Methods: To analyze transcriptome response to virus infection, we have assembled currently available microarray 
data on changes in gene expression levels in compatible Arabidopsis-V\rus interactions. We used the mean r 
(Pearson's correlation coefficient) for neighboring pairs to estimate pairwise local similarity in expression in the 
Arabidopsis genome. 

Results: Here we provide a functional classification of genes with altered expression levels. We also demonstrate 
that responsive genes may be grouped or clustered based on their co-expression pattern and chromosomal 
location. 

Conclusions: In summary, we found that there is a greater variety of upregulated genes in the course of viral 
pathogenesis as compared to repressed genes. Distribution of the responsive genes in combined viral databases 
differed from that of the whole Arabidopsis genome, thus underlining a role of the specific biological processes in 
common mechanisms of general resistance against viruses and in physiological/cellular changes caused by 
infection. Using integrative platforms for the analysis of gene expression data and functional profiling, we identified 
overrepresented functional groups among activated and repressed genes. Each virus-host interaction is unique in 
terms of the genes with altered expression levels and the number of shared genes affected by all viruses is very 
limited. At the same time, common genes can participate in virus-, fungi- and bacteria-host interaction. According 
to our data, non-homologous genes that are located in close proximity to each other on the chromosomes, and 
whose expression profiles are modified as a result of the viral infection, occupy 12% of the genome. Among them 
5% form co-expressed and co-regulated clusters. 
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Background 

Viruses are among the most agriculturally important 
groups of plant pathogens, causing serious economic 
losses in many major crops by reducing yield and quality 
[1]. Although viruses have relatively simple genetic 
structure, the detailed mechanisms of their interaction 
with host plants and means by which they manipulate a 
plants physiology toward their needs and trigger anti- 
viral responses in hosts are still not well-defined [2-5]. 
Among the most important consequences of viral 
pathogenesis are changes in expression of host genes 
that define both the fate of the virus and the hosts sur- 
vival chances. If plants are capable of efficiently fighting 
infections by inherited genetic tools, such as resistance 
(R) genes that are abundant in every plant species [6], 
they immediately initiate general resistance pathways 
leading to a hypersensitive response (HR). In susceptible 
plants lacking R genes to a specific viral pathogen, 
viruses induce a variety of responses to prime and ele- 
vate their infections. These include expression changes 
associated with cellular processes redirected by viruses 
for their demands and host defensive reactions to the 
pathogenesis [3]. Understanding the balance and inter- 
play between these two types of responses would bring 
light to poorly characterized molecular mechanisms of 
viral comprehensive control of host immune system and 
to the counteracting host signaling pathways. It will also 
help to explain continuous and interconnected genetic 
variability in viral and host populations, that is, co- 
evolution of plants and viruses. 

At the moment, there are a number of publications 
describing gene expression profiling in virus-infected 
plants that are derived mostly from DNA microarrays. 
They indicate a significant impact of viral infection on a 
wide array of cellular processes [7]. Usually, altered 
functional categories include responses to biotic and 
abiotic stresses, changes in basal plant metabolism, pro- 
tein synthesis, developmental and photosynthetic pro- 
cesses [7-10]. 

Most of the data are limited to specific host-pathogen 
interactions involving a given virus and a model host 
plant, which usually is Arabidopsis thaliana. In spite of 
several efforts to summarize general changes in plant 
gene expression (due to viral, bacterial and fungal infec- 
tions, insect attack, other biotic and abiotic stresses) 
having been made [3-5,11], a general picture of gene ex- 
pression changes in susceptible virus-host interactions is 
missing. Detailed knowledge about the groups of host 
genes participating in and/or responsive to viral patho- 
genesis may lead to new assertions on how host cells are 
controlled by infection, which defense and stress 
mechanisms are deployed, and why disease symptoms or 
deviation from normal in the growth of a plant are 
developed [1,5]. 



In this work, in order to analyze transcriptome re- 
sponse to virus infection, we have assembled currently 
available microarray data on changes in gene expression 
levels in compatible Arabidopsis-vims interactions and 
attempted to create a functional classification of the 
genes with altered expression levels. We conclude that 
each virus-host interaction is unique in terms of the 
genes with altered expression levels and the number of 
shared genes affected by all viruses is very limited. Import- 
antly, we also demonstrate that responsive genes may be 
grouped or clustered based on their co-expression pattern 
and chromosomal location. 

Methods 

Data source, microarray data 

Arabidopsis expression data were obtained from Notting- 
ham Arabidopsis Stock Centre microarray (NASC), 
ArrayExpress from the European Bioinformatics Institute 
database and the Gene Expression Omnibus database. 
Additional data were retrieved from supplementary ma- 
terial of published papers [see Additional file 1]. The 
data sets were log transformed (when needed) and sig- 
nificant genes were selected according to P< 0.05. We 
selected only those genes that significantly changed their 
expression level in response to pathogen attack by at 
least two-fold. Tandem duplicates were removed from 
the resulting profile. The total number of collected genes 
across all experiments was 52488. These data represent 
44 experiments with 3 different types of pathogens: virus, 
bacteria and fungus. Among them there were 11 viruses 
and the total number of genes with significantly altered 
expression elicited by these viruses was 16816. This 
number included many identical genes (with the same 
ID) recorded in different experiments. After subtraction 
of the repeating genes, a list of 7639 unique genes was 
obtained. The same data set was used to obtain data for 
bacteria 17734 (11409 unique genes) and 15426 for fungi 
(among them 11047 unique genes). 

Data analysis 

We performed a meta-analysis of all the collected data 
on compatible virus -host interactions and also on the 
whole database representing viral, bacterial and fungal 
interactions with the host plant. We used tools from 
TAIR to search GO annotations and functionally classify 
Arabidopsis genes. To find over-represented functional 
groups among activated or repressed genes during virus- 
host interactions we used Babelomics 4 FatiGO [12] and 
SAE from agriGO [13]. To visualize this data we used 
REViGO software [14]. 

Clustering pathogen related genes 

The level of co-expression between two genes was 
defined as the Pearsons correlation coefficient (r) of the 
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expression level for these genes. To test for pairwise local 
similarity in expression in the Arabidopsis genome, the 
mean r of the expression profiles for neighboring pairs of 
genes was calculated [15]. 

The mean r calculated from the real data set was then 
compared with the mean r calculated from 1000 data 
sets in which the order of genes in the Arabidopsis gen- 
ome was randomized. We generated the stochastic distri- 
bution using a function that generates an even 
distribution of stochastic numbers. The proportion of 
genes found in clusters and the size distribution of clus- 
ters were calculated, and the values were averaged for 
1000 iterations. 

Results and discussion 

Broad changes in gene expression during susceptible 
virus-host interactions 

To analyze plant response to virus infection, we have 
assembled currently available microarray data on changes of 
gene expression levels in Arabidopsis thaliana in response 
to infection with various plant viruses: Cabbage leaf curl 
virus (CalCuV), Cauliflower mosaic virus (CAMV), Cucum- 
ber mosaic virus (CMV), Lettuce mosaic virus (LMV), Plum 
pox virus (PPV), Turnip crinkle virus (TCVJt Tobacco etch 
virus (TEV), Tobacco mosaic virus (TMV and TMV-Cg), 
Tobacco rattle virus (TRV), Turnip mosaic virus (TuMV) 
and Oilseed rape mosaic tobamovirus (ORMV) [7,8,10,16- 
22]. 

The total number of genes in the assembled experi- 
ments with significantly altered expression elicited by 
these viruses was 16816. Among them 8684 were upre- 
gulated and 8132 were downregulated (a threshold of at 
least 2-fold change in expression level). However, this 
number included many identical genes (with the same 
ID) recorded in different experiments. After subtraction 
of the repeated genes, a list of 7639 unique genes was 
obtained [see Additional file 2], which represents 23% of 
the whole Arabidopsis genome. These are the genes ei- 
ther needed for the host to defend itself against the virus 
or for the virus to re-arrange host cellular machinery for 
its own needs. More than two thirds of these genes 
(69%) were always upregulated and only 13% were always 
downregulated. A sizeable portion of the genes (17%) 
had differential expression in response to infection with 
different viruses (Figure 1). Thus, the total number of 
induced genes (5282) exceeds that of repressed genes 
(1056) more than five-fold in our reduced (unique IDs) 
database. Approximately 15.5% of responsive genes had 
previously been described as involved in plant defense. 

Does the larger number of activated genes as com- 
pared to repressed genes correspond to a general trend 
of plant response to virus, reflecting a greater diversity of 
upregulated genes? In other words, based on this infor- 
mation, can we conclude as other authors have done 




Figure 1 Venn diagram depicting the distribution of 7639 
unique genes in response to infection with different viruses. 

The yellow circle represents induced genes and the green circle 

represents repressed genes. The radius of each circle corresponds to 

the number of genes in the group of induced or repressed genes. 
^ J 



[23], that there is a widespread induction of the hosts 
biological processes due to the virus infection? De facto, 
it depends on several conditions. First, individual data- 
bases available for different viruses differ extensively in 
the number of repressed or induced genes and combined 
analysis is greatly influenced by this ratio in the most 
comprehensive databases. Second, as mentioned above, 
the pool of upregulated genes is larger because of the 
greater variety of affected genes whereas the quantity of 
downregulated genes is limited. Otherwise stated, more 
diverse genes are upregulated during different virus 
infections whereas downregulated genes tend to be com- 
mon regardless of the particular virus. Lastly, a more 
accurate illustration of the general status of gene expres- 
sion changes can be derived from the analysis of their 
proportional representation in the sets of induced or 
repressed genes within each functional category, which is 
the subject of the following section. 

Using TAIRs functional categorization, we first assigned 
each gene to one of the three main gene ontologies (GO) 
- Biological Process, Cellular Components and Molecular 
Function, and next to a specific functional category (FC). 
It is important to emphasize three key points when relying 
on the GO terms in analyzing expression profiles: their 
generality, their obvious redundancy and their incom- 
pleteness. Redundant annotations and multiple descrip- 
tions of the same biological mechanisms represent special 
concern undermining an effort to address consistency in 
characterization of gene products. Still, the Gene Ontol- 
ogy project [24] currently provides the most constructive 
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way to find functionally equivalent terms for the purpose 
of classifying gene product properties. 

Figure 2 and Table 1 show distributions of the respon- 
sive genes in different FC with respect to the total num- 
ber of genes in each given GO domain of the assembled 
viral database. Noticeably, some of the key functions 
with a large number of affected genes (upregulated- or 
downregulated) are in the following categories: chloro- 
plast (21% of total genes in FC), nucleus (15%) and cyto- 
sol (15%) in GO Cellular Components; hydrolase and 
transferase activity (13% both), protein and DNA or 
RNA binding (16% and 10%, respectively) in GO Mo- 
lecular Function; protein metabolism (18%) and response 
to stress (16%) in GO Biological Process. 

It was essential to determine the extent of involvement 
of specific FC that represents groups of genes implicated 
in a particular biological mechanism in the host reaction 
to infection. Therefore we compared the distribution of 
genes assigned to different FC on the whole genome of 
Arabidopsis with the corresponding distribution within 



our database of genes that are involved in response to virus 
infection (Table 1), [see Additional file 3]. Presumably, the 
greater the share each category occupies in the virus 
database versus in the whole genome, the greater this 
FC participates in host response. We found that the 
percentage of genes covered by several categories, such 
as cell wall, cytosol, extracellular, ribosome, electron 
transport or energy pathways, was twice as much as the 
normal distribution in the whole genome, thus empha- 
sizing the important role of these functions in host-viral 
interactions. A share of genes in the FC "response to 
stress" was also 1.7 times higher in the viral database as 
compared to the whole genome (Table 1) and [see 
Additional file 3]. 

Common responses to different viruses 

To find common responses to different viruses, we com- 
pared patterns of gene response in individual susceptible 
interactions. In order to do this, we used the number of 
shared genes among every pair of viruses to compute a 




GO molecular function 

Transporter DNAorRNA Hydrolase 




binding 
88 



Figure 2 Distributions of the responsive genes by FC in each given GO domain of the assembled viral database. 
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Table 1 Distributions of genes in the three main GO domains for the whole genome and assembled viral database 



Number of genes in % of GO 

genome domain 



Number of genes % of virus 
in virus database database 



Ratio % of virus/% 
of GO domain 



GO Cellular Component 

cell wall 605 

chloroplast 2941 

cytosol 1662 

ER 425 

extracellular 454 

Golgi apparatus 249 

mitochondria 1123 

nucleus 2504 

other cellular components 4405 

other cytoplasmic components 3384 

other intracellular components 4310 

other membranes 3373 

plasma membrane 1862 

ribosome 472 

unknown cellular components 9632 

Total 24956 

GO Molecular Function 

DNA or RNA binding 2894 

hydrolase activity 2959 

kinase activity 1342 

nucleic acid binding 1467 

nucleotide binding 2114 

other binding 4529 

other enzyme activity 3200 

other molecular functions 1003 

protein binding 2426 

receptor binding or activity 271 

structural molecule activity 536 

transcription factor activity 1681 

transferase activity 2509 

transporter activity 1 266 

unknown molecular functions 10851 

Total 27340 

GO Biological Process 

cell organization and biogenesis 1245 

developmental processes 2309 

DNA or RNA metabolism 444 

electron transport or energy 294 
pathways 

other biological processes 2157 

other cellular processes 12254 

other metabolic processes 12875 

protein metabolism 4256 



2.42 

11.78 

6.66 

1.70 

1.82 

1.00 

4.50 

10.03 

17.65 

13.56 

17.27 

13.52 

7.46 

I. 89 
38.60 
100.00 

10.59 

10.82 

4.91 

5.37 

7.73 

16.57 

II. 70 
3.67 
8.87 
0.99 
1.96 
6.15 
9.18 
4.63 
39.69 
100.00 

4.44 
8.24 
1.58 
1.05 

7.70 
43.72 
45.93 
15.18 



302 

1308 

918 

200 

233 

109 

425 

936 

1056 

1695 

1914 

1358 

908 

232 

1024 

6116 

693 

913 

426 

216 

655 

1258 

1157 

311 

1090 

88 

252 

450 

870 

420 

1651 

6907 

440 
758 
114 
151 

965 
3950 
4092 
1252 



4.94 

21.39 

15.01 

3.27 

3.81 

1.78 

6.95 

15.30 

17.27 

27.71 

31.29 

22.20 

14.85 

3.79 

16.74 

100.00 

10.03 

13.22 

6.17 

3.13 

9.48 

18.21 

16.75 

4.50 

15.78 

1.27 

3.65 

6.52 

12.60 

6.08 

23.90 

100.00 

6.23 
10.73 
1.61 
2.14 

13.66 
55.93 
57.94 
17.73 



2.04 
1.81 
2.25 
1.92 
2.09 
1.79 
1.54 
1.53 
0.98 
2.04 
1.81 
1.64 
1.99 
2.01 
0.43 



0.95 
1.22 
1.26 
0.58 
1.23 
1.10 
1.43 
1.23 
1.78 
1.29 
1.86 
1.06 
1.37 
1.31 
0.60 



1.40 
1.30 
1.02 
2.04 

1.78 
1.28 
1.26 
1.17 
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Table 1 Distributions of genes in the three main GO domains for the whole genome and assembled viral database 

(Continued) 



response to abiotic or biotic 
stimulus 


2175 


7.76 


1037 


14.68 


1.89 


response to stress 


2424 


8.65 


1097 


15.53 


1.80 


signal transduction 


1366 


4.87 


431 


6.10 


1.25 


transport 


2080 


7.42 


726 


10.28 


1.39 


unknown biological processes 


11282 


40.25 


1800 


25.48 


0.63 


Total 


28031 


100.00 


7063 


100.00 





similarity matrix between them according to the formula 
S i j = 2n i jl{n i + nj) ) where and rij are the number of 
genes with altered expression level belonging to the data- 
base for virus i and virus /, respectively, and repre- 
sents the number of genes shared between both viruses 
[9]. Next, we arranged the computed data in accordance 
with the value of S t j. the higher this value is, the more 
similarity that exists between the two compared virus- 
host interactions. 

As presented in Figure 3, changes of the expression pat- 
tern in response to infection with the majority of viruses 
were similar to the ones associated with TMV and TMV- 
Cg. Responses to TEV and LMV potyviruses also showed 
significant similarity to each other. On the other hand, 
even responses to RNA versus DNA viruses can be quite 
similar as exemplified by CMV and CalCuV. 

Although the number of common genes affected by all 
viruses is very limited, each virus-host interaction is unique 
in terms of which genes have altered expression levels. 
Among them are several pathogenesis -related (PR) genes, 
albeit they seem to be specifically upregulated in response 
to particular viruses: CalCuV, TMV and TEV (PR1), CMV, 
CalCuV, TMV and TRV (PR2 and PR4), LMV and TRV 
(PR5), CMV and TMV (PR3). 

Overall, we have found only 198 genes that frequently 
change their expression in response to the majority of 
the viruses [see Additional file 4]. Those include genes 



participating in the defense or immune pathways as well 
as genes of catabolic and regulation processes. One of 
the most frequently induced genes in all interactions is 
AT5G38530 of the tryptophan biosynthetic pathway. The 
tryptophan pathway provides precursors for the synthesis 
of key secondary metabolites such as auxin, indole-3- 
acetic acid (IAA), and other molecules that help protect 
plants against pathogens and herbivores [25]. 

Proportional representation of different functional 
categories in the sets of induced or repressed genes 

To identify overrepresented functional groups among 
activated or repressed genes, we subjected genes to ana- 
lysis by Babelomics 4 FatiGO [12] and SEA from agriGO 
[13]. FatiGO uses Fishers exact test for 2 x 2 contingency 
tables to scan for significant over-representation of GO 
terms in one set with respect to the other. Singular en- 
richment analysis (SEA) identifies enriched GO terms in 
a list of microarray probe sets or gene identifiers. Using 
both FatiGO and SEA ensures finding accurate, con- 
densed biological data by comparing a query list to a 
background population from which it is derived [12]. 
That is, such analyses predict a role of a certain bio- 
logical processes in total response to infection rather 
than merely calculate a number of upregulated and 
downregulated genes. 
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Figure 3 Similarity matrix (S,y) reflecting changes in the Arabidopsis transcriptome in response to different Arabidopsis-virus 
interactions. The higher the value of the S,y, the more similarity exists between two compared virus-host interactions. Each cell represents an 
individual virus-host interaction Sg, color-delineated according to the level of similarity ranging from considerable (red) to weak (dark blue). 
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When implemented with the assembled data set, 
FatiGO and SEA identified over-represented FC and sub- 
categories among the sets of induced or repressed genes 
belonging to each of the main GO domains. Overrepre- 
sented in the set of repressed genes were those involved 
in defense response, hormone signaling (JA, ABA), re- 
sponse to external stimulus, photosynthesis and bioener- 
getics processes (encoding photosystem I and II proteins 
and electron transport chain) [see Additional file 5]. 
Downregulation of these functions is presumably due to 
the virus overtaking host defense-related pathways and 
causing physiological changes associated with the disease 
symptoms [22]. 

Overrepresented in the set of induced genes were 
those participating in response to abiotic stimulus, 
responses to organic and inorganic substances, nitrogen 
component metabolic processes and protein transport 
(Golgi vesicle transport, protein targeting, and cytoskel- 
etal protein binding). The latter host pathways are essen- 
tial for facilitating virus intracellular movement. Another 
example of a biological process that was found only in 
the upregulated gene set is chromatin organization (his- 
tone modification). Chromatin structural features and 
posttranslational modifications play a crucial role in the 
regulation of gene expression [26]. Epigenetic marks' 
generated by modifications of histones and DNA are 
spread over vast regions of chromosomes and can be 
altered in response to stress. 

Assembling information from multiple sources, such 
as different microarray platforms, experimental condi- 
tions, stages of infection when samples were collected, 
etc. raises a question of the integrity of the combined 
data, since it is hardly possible to eliminate the "batch ef- 
fect" from influencing final results. Even so, these dispar- 
ities are not likely to change the biological truth. For 
instance, as presented in Figure 3, the values for TMV 
and TMV-Cg (a crucifer-infecting strain of TMV) are 
very similar to each other even though they were 
obtained by different authors using different platforms in 
totally different environments. To take into account 
some of the determining factors (such as infection stages 
at the time of analyses), when different genes may be- 
come activated and/or repressed, we also looked into 
combined microarray data on early and late responses to 
virus infection and analyzed it as much as statistically 
possible. 

Applying agriGO tools, we found that early, non- 
symptomatic, phases of infection are characterized by 
massive induction of genes belonging to both common 
and stress-responsive pathways. Overrepresented in the 
set of activated genes were amine biosynthetic processes, 
aromatic amino acid family metabolic processes, photo- 
synthetic activity and responses to biotic and abiotic 
stresses (Figure 4). Late stages of pathogenesis, when 



plants are systemically infected, are characterized by re- 
pression of the majority of the stress-responsive path- 
ways activated at the early phases, such as response to 
abscisic acid stimulus, response to wounding, innate im- 
mune response, response to oxidative stress, response to 
auxin stimulus, callose deposition in cell wall and glyco- 
sinolate metabolic process. 

One of the main features of the late host response is 
repression of photosynthetic and energy pathways: 
photosystem I and II assembly, pentose-phosphate cycle, 
etc. (Figure 4). Chlorosis, or yellowing of normally green 
plant tissue, because of the disruption of chloroplast 
structure and function and a decreased amount of 
chlorophyll is often a direct result of these deficiencies. 
On the contrary, cellular respiration, catabolic processes, 
proteolysis and senescence are overrepresented in the 
induced gene set at the late stages of virus infection. 
Among the common characteristics of both the early and 
late responses is negative regulation of developmental pro- 
cesses. In essence, sets of host genes affected at the late 
stage of infection closely resemble the general picture of 
gene expression changes caused by viral pathogenesis. 

Pathogenesis-related two- to four-gene clusters in the 
genome of Arabidopsis thaliana 

While looking at the data derived from the analysis of 
publicly available microarray repositories, we noticed 
that genes with expression profiles modified as a result 
of viral infection were often (12% of genome) located in 
close proximity to each other on the same chromosomes. 
Analyzing only close proximity and differential response 
to the infection (repressed or activated genes) we discov- 
ered 1594 such groups of genes (Figure 5A). Among 
them were 5 groups consisting of 8 genes, 7 groups of 7 
genes, 20 groups of 6 genes, and 35 groups of 5 genes. 
Assuming that the order of genes with altered expression 
patterns along the chromosomes is not accidental [27] 
but reflects their functional role, we hypothesized that 
these groups of neighboring genes distributed across the 
Arabidopsis genome may further be divided into co- 
regulated and co-expressed blocks of genes or clusters. 
Since microarray data sets on susceptible host-virus inter- 
actions were not large enough to statistically predict clus- 
ters of genes with similar expression changes, we 
combined them with data from microarray experiments 
representing bacterial-host and fungal-host interactions 
and then used "viral sets" as a base for filtering out only 
analogous genes. This way, we were able to compose 
groups of co-expressed genes. 

We found 207 neighboring co-expressed genes which fall 
into 98 clusters under conditions of pathogenesis [see Add- 
itional file 6]. These clusters consist of groups of physically 
linked and functionally related genes (response to patho- 
gen) that are co-expressed (correlation coefficient r>0.7) 
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Figure 4 Overrepresented GO terms in induced and repressed gene sets in the early and late stages of infection. The ontology table 
displays significant GO categories in response to infection with a compatible virus as determined by Babelomics 4 FatiGO and agriGO SEA. Box 
color reflects the flash discovery rate (FDR) to which the given GO category belongs, as shown on the scale. 
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and possibly co-regulated but share no sequence homology. 
Although most of them were differentially expressed, 22 
clusters were always upregulated and only 2 clusters were 
always downregulated. Among all identified clusters only 
two contained four genes, nine were composed of three 
genes, and eighty-six contained two genes (Figure 5B). 

Differences between the stochastic distribution and the 
actual distribution revealed in this experiment were 
observed both in the number of genes included in clus- 
ters and in cluster size (Figure 5C). The number of two- 
gene and three-gene clusters in the experiment was al- 
most 2.5 times and 4 times higher, respectively, than 
expected by chance. Four-gene clusters were obtained in 
the experiment only; clusters of this size were not pre- 
dicted to form by chance. We found 16 overlapping 



clusters between our pathogen-response clusters and 
those predicted by Zhan et al. using microarray data 
representing 128 experimental conditions [28]. Appar- 
ently, genes forming these clusters are broadly co- 
expressed in a wide range of conditions. 

To find out if there are any functional relationships be- 
tween locally co-expressed genes, we used TAIRs GO 
for Arabidopsis. We found 7 molecular functions, which 
are shared for each of 8 gene pairs as well as 13 cellular 
components that are common for 23 gene pairs. As 
revealed by the AraCyc database files from the Plant 
Metabolic Network, none of our clusters belongs to the 
same pathway [see Additional file 6]. 

Therefore, co-expressed neighbors do not seem to be 
associated with a particular GO [29]. However, it does 
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Figure 5 Clusters of non-homologous, defense-related genes in the genome of A. thaliana. (A) Chromosomal distribution of 1594 groups 
of neighboring genes, which responded differentially (repressed or activated) to various viral infections. (B) Distribution of pathogen-related, co- 
expressed clusters on the chromosomes of A thaliana. (C) Comparison between the number of clustered genes obtained from the random 
distribution data set or found experimentally. The height of the bars represents the number of defense-related gene clusters of the corresponding 
size either revealed in the experiment (black columns) or estimated by stochastic distribution (unfilled columns). 



not mean that clustered genes are not functionally 
related to each other. That is, in spite of belonging to 
different GO categories, co-expressed groups of genes 
are affiliated with the same function - stress response. 
Plants re-arrange their metabolism upon recognition of 
pathogen-associated molecular patterns (PAMP) so that 
genes of different GO categories that are involved in 
defense mechanisms are engaged [30]. 

Interestingly, one of the clusters includes three genes 
encoding leucine-rich repeat (LRR) family proteins: 
AT1G33590, AT1G33600 and AT1G33610. Two more 
genes that are absent in available microarray data sets, 
AT1G33612 and AT1G33670, also encode LRR proteins 
and are located in the same chromosomal region. In 
addition, we found three other clusters containing genes 
with common domain structure and functional character- 
istics: i. cluster with a Toll-Interleukin-Resistance (TIR) 
domain (AT1G72900 and AT1G72910); ii. cluster with 
genes encoding SAM superfamily proteins (S-adenosyl- 
L-methionine-dependent methyltransferases superfamily, 
AT4G00740, and AT4G00750); and iii. cluster with genes 
encoding histone superfamily proteins (AT4G40030 and 
AT4G40040). 

Previously, we reported on the clustering of pathogen- 
response genes in the genome of Arabidopsis thaliana 
[31]. That study was based on the profiling of EST data- 
bases derived from different plant species infected with 
fungi, bacteria, and viruses [32]. While comparing gene 
clusters revealed by broad EST mining with analysis of 
microarray data sets specific for compatible virus-host 
interactions (this investigation), we found that most 
groups of neighboring genes determined in the former 
study could be included in the clusters identified in this 



work, providing that both up and downregulated genes 
derived from different experiments are counted 
(Figure 5A). However, if only co-expressed genes are 
considered (Figure 5B), overlap between these two data 
sets is quite low, which could possibly be explained by a 
unique pattern of chromosomal gene clustering charac- 
teristic for different types and/or individual pathogens. 




Figure 6 Venn diagram depicting similarity between responses 

to viral, bacterial and fungal pathogens. The numbers in 

parentheses represent a number of unique genes responsive to a 

particular type of pathogen. 
v> ) 
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Common genes participate in virus-, fungi- and bacteria- 
host interaction 

As mentioned above, we tried to assemble the largest 
possible database of Arabidopsis genes responsive to viral 
infections using currently available microarray data. For 
comparison, we also put together genes derived from 
microarray experiments with bacteria and fungi. Since 
there is significantly more information on changes in 
plant gene expression due to infection with bacteria and 
fungi, we limited the number of genes to approximately 
the same number as was compiled for viruses: 17734 for 
bacteria (11409 unique genes) and 15426 for fungi 
(among them 11047 unique genes) [11,33-36]. 

Gene expression changes in response to all pathogens 
were very similar. In spite of specific interactions between 
host plants and each of the pathogens, nearly half of the 
genes associated with viral infections in susceptible hosts 



were also involved in response to bacterial or fungal infec- 
tions (Figure 6). Next, we selected genes which were 
induced or repressed in all three types of interactions [see 
Additional file 7]. Most of these genes belong to co- 
expressed chromosomal regions, or clusters: there were 79 
two-gene clusters and 4 three-gene clusters. This suggests 
that common genes participating in response to biotic 
stress may be co-regulated and organized in clusters. A 
small cluster containing non-homologous genes 
AT1G20100 and AT1G20110 is especially interesting since 
it is engaged in response to the majority of plant viruses. 
One of these genes encoding a RING/FYVE/PHD zinc fin- 
ger superfamily protein participates in signal transduction 
pathways and another one is a protein of unknown 
function. 

While determining significantly over-represented func- 
tional groups that include genes activated during virus- 
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Figure 7 Scatter plot of over-represented GO terms from upregulated gene sets of viral, bacterial and fungal pathogens. GO terms were 
obtained using the SAE tool from agriGO and data were visualized by REViGO. The axes have no intrinsic meaning; the guiding principle is that 
semantically similar GO terms remain close together in the plot. Color scale indicates log 1 0 p-value (red is higher and blue is lower). Disc size is 
proportional to the log number of genes in the category. 
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host interactions, we have also found groups of genes 
whose biological functions are common for all three 
pathogens (virus, bacteria, and fungus) regardless of 
whether there is a susceptible or resistant type of inter- 
action. To obtain significantly over-represented GO 
terms from gene sets upregulated during infection with 
these pathogens, we used the SAE tool from agriGO 
[13]. In order to visualize data we used REViGO [14]. 
We found that some of the GO categories, such as sulfur 
metabolism, cellular aromatic compound metabolism, 
and cell wall modification were enriched in the down 
regulated genes under compatible virus infections. How- 
ever, when we considered resistant-type interactions of 
bacteria and fungi with host plants, we found that the 
same GO categories were enriched with activated genes 
(Figure 7). In addition, when genes involved in general 
immune responses were analyzed using SAE and 
REViGO in both susceptible and resistant types of inter- 
actions, they were also found to be induced. Unfortu- 
nately, the limited amount of microarray data did not 
allow a full-scale comparison between susceptible and 
resistant types of interactions, which would be useful in 
terms of understanding the mechanisms of R gene- 
mediated resistance. 

Conclusions 

We have assembled currently available microarray data 
on changes in gene expression levels in compatible Ara- 
bidopsis-vims interactions. In summary, we found that 
there is a greater variety of upregulated genes in the 
course of viral pathogenesis as compared to repressed 
genes. Distribution of the responsive genes in combined 
viral databases differed from that of the whole Arabidopsis 
genome, thus underlining a role of the specific FC in com- 
mon mechanisms of general resistance against viruses and 
in physiological/cellular changes caused by infection. Using 
integrative platforms for the analysis of gene expression data 
and functional profiling, we identified overrepresented func- 
tional groups among activated and repressed genes, which 
provided an in-depth view of the role of certain biological 
processes in response to infection. Each virus-host inter- 
action was found to be unique in terms of the genes with 
altered expression levels, and the number of common genes 
affected by all viruses was very limited. We discovered that 
genes with expression profiles modified as a result of viral 
infection were often located in close proximity to each other 
on the same chromosomes forming a multiple clusters, 
consisting of physically linked and functionally related 
genes. Finally, combining genes derived from microarray 
experiments with bacteria and fungi with a viral data set, we 
observed that gene expression changes in response to all 
pathogens were very similar and that nearly half of the 
genes associated with viral infections in susceptible hosts 



were also involved in response to bacterial or fungal 
infections. 
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